Project Home

Load packages used to generate figures on this page:

# General data wrangling
library(tidyverse)
library(knitr)
library(kableExtra)
library(DT)
library(lubridate)
library(readxl)

# Visualization
library(plotly)
library(forcats)
library(ggsignif)

# ggseg is used to visualize the brain
# remotes::install_github("LCBC-UiO/ggseg")
# If that doesn't work: 
# download.file("https://github.com/LCBC-UiO/ggseg/archive/master.zip", "ggseg.zip")
# unzip("ggseg.zip")
# devtools::install_local("ggseg-master")
library(ggseg)

# remotes::install_github("LCBC-UiO/ggseg3d")
library(ggseg3d)

# remotes::install_github("LCBC-UiO/ggsegExtra")
library(ggsegExtra)

Tau-PET Data

Data overview

The longitudinal tau-PET dataset was downloaded as a CSV from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Study Data repository located at Study Data/Imaging/PET Image Analysis/UC Berkeley - AV1451 Analysis [ADNI2,3] (version: 5/12/2020). This CSV file contains 1,121 rows and 241 columns. Note:ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

Data loading

Load partial volume corrected regional tau-PET data, as downloaded from ADNI:

tau.df <- read.csv("../../ADNI_Data/Raw_Data/UCBERKELEYAV1451_PVC_05_12_20.csv")
tau.df$EXAMDATE = as.Date(tau.df$EXAMDATE, format="%m/%d/%Y")

Each row in the CSV represents one tau-PET scan (see str call below). Ssome subjects had repeated scans separated by approximately one year, while other subjects had only one scan. Columns include subject information including anonymized subject ID, visit code, and PET exam date. The other columns encode regional volume and tau-PET uptake. Specifically, there are 123 distinct cortical and subcortical regions of interest (ROIs), each of which has a volume field (in mm^3) and a tau-PET uptake field, called the Standardized Uptake Value Ratio (SUVR).

str(tau.df)
## 'data.frame':    1120 obs. of  165 variables:
##  $ RID                                   : int  21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE                               : chr  "init" "init" "y1" "init" ...
##  $ VISCODE2                              : chr  "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                              : Date, format: "2018-02-02" "2018-04-24" "2019-04-23" ...
##  $ INFERIOR_CEREBGM_SUVR                 : num  1.32 1.33 1.33 1.28 1.24 ...
##  $ INFERIOR_CEREBGM_VOLUME               : int  52306 54296 54296 56750 56750 56750 59836 56862 56862 56862 ...
##  $ HEMIWM_SUVR                           : num  1.02 0.85 0.866 1.138 1.196 ...
##  $ HEMIWM_VOLUME                         : int  321220 281690 281690 336495 336495 336495 294422 463900 463900 463900 ...
##  $ BRAAK12_SUVR                          : num  2.06 2.24 2.3 1.91 1.88 ...
##  $ BRAAK12_VOLUME                        : int  10275 7587 7587 9376 9376 9376 10379 10981 10981 10981 ...
##  $ BRAAK34_SUVR                          : num  1.95 1.87 1.8 1.82 1.77 ...
##  $ BRAAK34_VOLUME                        : int  95661 95419 95419 92482 92482 92482 94092 112788 112788 112788 ...
##  $ BRAAK56_SUVR                          : num  1.99 1.92 1.84 1.87 1.84 ...
##  $ BRAAK56_VOLUME                        : int  284821 288136 288136 283119 283119 283119 283727 325054 325054 325054 ...
##  $ BRAIN_STEM_SUVR                       : num  1.27 1.12 1.12 1.2 1.17 ...
##  $ BRAIN_STEM_VOLUME                     : int  16955 16952 16952 20508 20508 20492 18057 18872 18872 18866 ...
##  $ LEFT_MIDDLEFR_SUVR                    : num  2.02 1.93 1.8 1.83 1.78 ...
##  $ LEFT_MIDDLEFR_VOLUME                  : int  17640 18517 18517 17164 17164 17164 17683 21907 21907 21907 ...
##  $ LEFT_ORBITOFR_SUVR                    : num  2.17 2.03 1.92 2.11 1.98 ...
##  $ LEFT_ORBITOFR_VOLUME                  : int  11676 10091 10091 11721 11721 11721 10917 12109 12109 12109 ...
##  $ LEFT_PARSFR_SUVR                      : num  2.02 2.01 1.98 2.03 1.99 ...
##  $ LEFT_PARSFR_VOLUME                    : int  9201 7799 7799 9185 9185 9185 7709 9813 9813 9813 ...
##  $ LEFT_ACCUMBENS_AREA_SUVR              : num  1.14 1.04 1.79 1.12 1.18 ...
##  $ LEFT_ACCUMBENS_AREA_VOLUME            : int  500 318 318 308 308 308 353 361 361 361 ...
##  $ LEFT_AMYGDALA_SUVR                    : num  1.31 1.54 1.63 1.42 1.37 ...
##  $ LEFT_AMYGDALA_VOLUME                  : int  1367 1224 1224 1561 1561 1561 993 1499 1499 1499 ...
##  $ LEFT_CAUDATE_SUVR                     : num  2.08 1.46 1.34 1.95 1.83 ...
##  $ LEFT_CAUDATE_VOLUME                   : int  3016 4890 4890 3083 3083 3083 2874 4049 4049 4049 ...
##  $ LEFT_HIPPOCAMPUS_SUVR                 : num  2.12 1.96 2.2 1.69 1.73 ...
##  $ LEFT_HIPPOCAMPUS_VOLUME               : int  3822 3050 3050 3476 3476 3476 3603 3550 3550 3550 ...
##  $ LEFT_PALLIDUM_SUVR                    : num  3.79 1.89 1.95 2.5 2.6 ...
##  $ LEFT_PALLIDUM_VOLUME                  : int  444 2066 2066 1301 1301 1301 1081 1634 1634 1634 ...
##  $ LEFT_PUTAMEN_SUVR                     : num  1.69 1.64 1.42 1.9 1.78 ...
##  $ LEFT_PUTAMEN_VOLUME                   : int  4000 5675 5675 4832 4832 4832 3563 4891 4891 4891 ...
##  $ LEFT_THALAMUS_PROPER_SUVR             : num  1.45 1.32 1.24 1.54 1.53 ...
##  $ LEFT_THALAMUS_PROPER_VOLUME           : int  8226 6195 6195 7114 7114 7114 7561 7518 7518 7518 ...
##  $ RIGHT_MIDDLEFR_SUVR                   : num  2.08 1.91 1.8 1.94 1.85 ...
##  $ RIGHT_MIDDLEFR_VOLUME                 : int  17250 18440 18440 15605 15605 15605 16280 22586 22586 22586 ...
##  $ RIGHT_ORBITOFR_SUVR                   : num  2.19 2.01 1.86 2.17 2.03 ...
##  $ RIGHT_ORBITOFR_VOLUME                 : int  11614 12637 12637 11064 11064 11064 11537 12575 12575 12575 ...
##  $ RIGHT_PARSFR_SUVR                     : num  2.17 2.08 1.9 2.09 2.01 ...
##  $ RIGHT_PARSFR_VOLUME                   : int  9255 8131 8131 9641 9641 9641 8839 9119 9119 9119 ...
##  $ RIGHT_ACCUMBENS_AREA_SUVR             : num  1.41 1.65 1.66 1.01 1.07 ...
##  $ RIGHT_ACCUMBENS_AREA_VOLUME           : int  545 413 413 423 423 423 542 528 528 528 ...
##  $ RIGHT_AMYGDALA_SUVR                   : num  1.18 1.79 1.89 1.37 1.44 ...
##  $ RIGHT_AMYGDALA_VOLUME                 : int  1268 1028 1028 1464 1464 1464 1313 1797 1797 1797 ...
##  $ RIGHT_CAUDATE_SUVR                    : num  2.01 1.57 1.37 1.96 1.89 ...
##  $ RIGHT_CAUDATE_VOLUME                  : int  3179 4854 4854 2984 2984 2984 3224 3835 3835 3835 ...
##  $ RIGHT_HIPPOCAMPUS_SUVR                : num  2.01 2.09 2.03 1.62 1.64 ...
##  $ RIGHT_HIPPOCAMPUS_VOLUME              : int  3978 2723 2723 3489 3489 3489 3667 3942 3942 3942 ...
##  $ RIGHT_PALLIDUM_SUVR                   : num  3.01 2.32 2.12 2.33 2.48 ...
##  $ RIGHT_PALLIDUM_VOLUME                 : int  846 1531 1531 1262 1262 1262 1088 1552 1552 1552 ...
##  $ RIGHT_PUTAMEN_SUVR                    : num  1.68 1.62 1.53 2.06 1.94 ...
##  $ RIGHT_PUTAMEN_VOLUME                  : int  4322 5774 5774 4328 4328 4328 3190 4569 4569 4569 ...
##  $ RIGHT_THALAMUS_PROPER_SUVR            : num  1.42 1.33 1.24 1.52 1.55 ...
##  $ RIGHT_THALAMUS_PROPER_VOLUME          : int  5968 5442 5442 5940 5940 5940 6257 7899 7899 7899 ...
##  $ CHOROID_SUVR                          : num  7.45 4.56 4.31 3.84 3.79 ...
##  $ CHOROID_VOLUME                        : int  4180 3591 3591 3165 3165 3165 3717 3663 3663 3663 ...
##  $ CTX_LH_BANKSSTS_SUVR                  : num  1.75 1.49 1.6 1.7 1.63 ...
##  $ CTX_LH_BANKSSTS_VOLUME                : int  1553 1633 1633 1812 1812 1812 1694 2601 2601 2601 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_SUVR   : num  1.67 1.73 1.65 1.69 1.69 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_VOLUME : int  1138 1387 1387 1124 1124 1124 1465 1512 1512 1512 ...
##  $ CTX_LH_CUNEUS_SUVR                    : num  2.33 2.2 2.05 2.01 2 ...
##  $ CTX_LH_CUNEUS_VOLUME                  : int  2023 2702 2702 2429 2429 2429 2393 2222 2222 2222 ...
##  $ CTX_LH_ENTORHINAL_SUVR                : num  2.07 2.3 2.43 2.79 2.52 ...
##  $ CTX_LH_ENTORHINAL_VOLUME              : int  1468 1035 1035 1068 1068 1068 1297 1888 1888 1888 ...
##  $ CTX_LH_FUSIFORM_SUVR                  : num  1.97 1.87 1.83 1.84 1.77 ...
##  $ CTX_LH_FUSIFORM_VOLUME                : int  7956 6997 6997 7694 7694 7694 7807 9083 9083 9083 ...
##  $ CTX_LH_INFERIORPARIETAL_SUVR          : num  1.99 1.95 1.94 1.85 1.89 ...
##  $ CTX_LH_INFERIORPARIETAL_VOLUME        : int  11656 10174 10174 9243 9243 9243 8180 9846 9846 9846 ...
##  $ CTX_LH_INFERIORTEMPORAL_SUVR          : num  2.16 1.97 2.05 2.1 2.02 ...
##  $ CTX_LH_INFERIORTEMPORAL_VOLUME        : int  6606 6418 6418 7286 7286 7286 6869 9599 9599 9599 ...
##  $ CTX_LH_INSULA_SUVR                    : num  1.51 1.64 1.65 1.51 1.48 ...
##  $ CTX_LH_INSULA_VOLUME                  : int  6711 4654 4654 6003 6003 6003 5513 6597 6597 6597 ...
##  $ CTX_LH_ISTHMUSCINGULATE_SUVR          : num  1.9 1.81 1.82 1.79 1.94 ...
##  $ CTX_LH_ISTHMUSCINGULATE_VOLUME        : int  2283 2215 2215 1549 1549 1549 1944 2264 2264 2264 ...
##  $ CTX_LH_LATERALOCCIPITAL_SUVR          : num  2.39 2.06 1.99 1.92 2 ...
##  $ CTX_LH_LATERALOCCIPITAL_VOLUME        : int  8532 10148 10148 8292 8292 8292 10612 9404 9404 9404 ...
##  $ CTX_LH_LINGUAL_SUVR                   : num  2.27 1.95 1.97 1.76 1.74 ...
##  $ CTX_LH_LINGUAL_VOLUME                 : int  4329 4658 4658 5606 5606 5606 5435 6488 6488 6488 ...
##  $ CTX_LH_MIDDLETEMPORAL_SUVR            : num  2.2 2.06 1.89 2.04 1.99 ...
##  $ CTX_LH_MIDDLETEMPORAL_VOLUME          : int  7445 8322 8322 7292 7292 7292 8031 9467 9467 9467 ...
##  $ CTX_LH_PARACENTRAL_SUVR               : num  1.99 1.79 1.8 1.91 1.8 ...
##  $ CTX_LH_PARACENTRAL_VOLUME             : int  2672 2890 2890 3231 3231 3231 3358 3173 3173 3173 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_SUVR           : num  1.6 1.86 1.92 1.72 1.66 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_VOLUME         : int  1659 1549 1549 1900 1900 1900 1989 2296 2296 2296 ...
##  $ CTX_LH_PERICALCARINE_SUVR             : num  2.23 1.45 1.41 1.56 1.54 ...
##  $ CTX_LH_PERICALCARINE_VOLUME           : int  1678 2004 2004 1866 1866 1866 1918 1927 1927 1927 ...
##  $ CTX_LH_POSTCENTRAL_SUVR               : num  2.03 1.81 1.82 1.85 1.78 ...
##  $ CTX_LH_POSTCENTRAL_VOLUME             : int  8281 8428 8428 8275 8275 8275 7580 8976 8976 8976 ...
##  $ CTX_LH_POSTERIORCINGULATE_SUVR        : num  1.82 1.89 1.84 1.72 1.67 ...
##  $ CTX_LH_POSTERIORCINGULATE_VOLUME      : int  2439 2608 2608 2683 2683 2683 2573 2638 2638 2638 ...
##  $ CTX_LH_PRECENTRAL_SUVR                : num  1.91 1.85 1.75 1.62 1.61 ...
##  $ CTX_LH_PRECENTRAL_VOLUME              : int  11174 12349 12349 10924 10924 10924 10820 12307 12307 12307 ...
##  $ CTX_LH_PRECUNEUS_SUVR                 : num  1.93 1.89 1.94 1.81 1.81 ...
##  $ CTX_LH_PRECUNEUS_VOLUME               : int  7870 8313 8313 8387 8387 8387 8311 8584 8584 8584 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_SUVR  : num  1.71 1.58 1.49 1.59 1.48 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_VOLUME: int  2928 2448 2448 1695 1695 1695 2466 2915 2915 2915 ...
##  $ CTX_LH_SUPERIORFRONTAL_SUVR           : num  1.86 1.86 1.74 1.84 1.77 ...
##   [list output truncated]

The SUVR value is normalized to the tau-PET uptake in the inferior cerebellum gray matter (highlighted in blue below), a commonly-used region for tau normalization given the lack of inferior cerebellar tau pathology in Alzheimer’s Disease.

aseg_3d %>% 
  unnest(ggseg_3d) %>% 
  ungroup() %>% 
  select(region) %>% 
  na.omit() %>% 
  mutate(val=ifelse(region %in% c("Right-Cerebellum-Cortex", "Left-Cerebellum-Cortex"), 1, 0)) %>%
  ggseg3d(atlas=aseg_3d, label="region", text="val", colour="val", na.alpha=0.5, 
           palette=c("transparent", "deepskyblue3"), show.legend=F) %>%
  add_glassbrain() %>%
  pan_camera("left lateral") %>%
  remove_axes()

All cortical and subcortical ROIs were delineated by first co-registering the tau-PET image to a high-resolution structural T1-weighted MPRAGE acquired in the same imaging session, and then applying FreeSurfer (v5.3) for automated regional segmentation and parcellation. The following diagram from Marcoux et al. (2018) summarizes this process:

Furthermore, to mitigate issues with lower voxel resolution in PET imaging, partial volume correction was applied to use probabilistic tissue segmentation maps to refine individual ROIs. Note: these PET processing steps were all performed by Susan Landau, Deniz Korman, and William Jagust at the Helen Wills Neuroscience Institute, UC Berkeley and Lawrence Berkeley National Laboratory.

18F-AV-1451 is a relatively recent PET tracer, and was only incorporated into the ADNI-3 pipeline beginning in 2016. I am curious about the temporal distribution of the FreeSurfer-analyzed scans here:

tau.df %>%
  # RID = unique subject identifier, EXAMDATE=date of PET scan
  select(RID, EXAMDATE) %>%
  # Create interactive plotly histogram, which auto-formats date along x-axis
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Tau-PET Scan Date Distribution',
         xaxis = list(title = 'Scan Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of PET Scans')) 

Even though ADNI3 officially began in 2016, most scans were acquired from mid-2017 to present day. This doesn’t affect this analysis, though, since the same tau-PET imaging protocol has been used across sites since 2016.

Data distribution

Since this project will explore tau-PET measurements over time, I will be refining the dataset to only subjects with multiple tau-PET scans. Here’s the overall distribution of number of longitudinal scans by subject:

# Plot number of PET scans by subject in a bar chart
p_num_long <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_scans=n()) %>%
  ggplot(., aes(x=fct_reorder(RID, n_scans, .desc=T), y=n_scans)) +
  geom_bar(stat="identity", aes(fill=n_scans, color=n_scans)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of Longitudinal PET Scans per Subject") +
  ylab("Number of PET Scans") +
  xlab("Subject") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

# Convert to interactive plotly chart
ggplotly(p_num_long)
rm(p_num_long)

The majority of subjects only had one tau-PET scan; given the longitudinal nature of this project, such subjects will eventually be omitted from analysis. On that note, it’s important to know exactly how many subjects do have at least two tau-PET scans:

# Calculate number of subjects with 2+ PET scans
num_scans <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  # Tabulate # scans by subject and filter
  summarise(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_scans=sum(n_scans))
# Print results
cat("Number of subjects with at least two scans: **", 
    num_scans$num_subjects, "**\n", 
    "\nNumber of total PET scans: **", 
    num_scans$total_scans, "**\n", sep="")

Number of subjects with at least two scans: 249

Number of total PET scans: 593

So, we have 249 subjects with two or more scans.

Temporal distribution

Another important consideration is the length of time between each consecutive scan. I will eventually normalize changes in tau-PET to number of years passed to yield an annual rate of change, but it’s good to know what that time interval is:

# Plot the # years in between each pair of consecutive tau-PET scans
p_pet_interval <- tau.df %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  # Calculate difference between pairs of scan dates using lag
  mutate(Years_between_Scans = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit the first scan, for which the time interval is zero
  filter(Years_between_Scans>0) %>%
  ggplot(., aes(x=Years_between_Scans)) +
  geom_histogram(stat="count", color="lightslategray") +
  ggtitle("Years in between Tau-PET Scans per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive scans for a subject") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 

# Convert to interactive plotly histogram
ggplotly(p_pet_interval)
rm(p_pet_interval)

There’s a cluster of scans around the one-year mark. Presumably, ADNI3 participants are enrolled in an annual tau-PET plan, though in some cases scans aren’t at precise yearly intervals.

Data missingness

I’ll check if there are any missing data points for tau-PET SUVR values for any of the FreeSurfer-derived cortical or subcortical regions of interest (ROIs). Note: this is filtered to show only subjects with at least two scans:

# Calculate number and proportion of missing data records for each measured region of interest (ROI)
tau.df %>%
  select(-VISCODE, -VISCODE2, -update_stamp) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  ungroup() %>%
  # filter only to the SUVR columns
  select(!matches("VOLUME")) %>%
  # Reshape from wide --> long
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  group_by(ROI) %>%
  # Calculate number and proportion of missing data points for each ROI
  summarise(`Percent Missing` = sum(is.na(SUVR))/n(),
            `Number Missing` = sum(is.na(SUVR))) %>%
  datatable()

All regions have zero missing data points, so no imputation will be necessary.

SUVR Distribution by Region

Now, I’ll check the distribution of tau-PET uptake values across the ROIs.

# plot the mean tau-PET SUVR for each region of interest
p_roi_suvr <- tau.df %>%
  # omit irrelevant variables
  select(-VISCODE, -VISCODE2, -update_stamp) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # only look at SUVR columns; reshape wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = tolower(str_replace(ROI, "_SUVR", ""))) %>%
  group_by(ROI) %>%
  # calculate mean and SD, along with ymin and ymax, to plot error bars for each ROI
  summarise(Mean_SUVR=mean(SUVR, na.rm=T),
            SD_SUVR = sd(SUVR, na.rm=T),
            ymin = Mean_SUVR-SD_SUVR,
            ymax = Mean_SUVR+SD_SUVR) %>%
  # fct_reorder --> arrange ROI by its average tau-PET SUVR
  ggplot(data=., mapping=aes(x=fct_reorder(ROI, Mean_SUVR, .desc=F), 
                             y=Mean_SUVR,
                             label = ROI)) +
  geom_bar(stat="identity", show.legend=F, fill="lightsteelblue") +
  # Add error bars
  geom_errorbar(aes(ymin=ymin, ymax=ymax), width=0, color="lightslategray") +
  coord_flip() +
  theme_minimal() +
  ylab("Mean Tau-PET SUVR") +
  xlab("Region of Interest") +
  ggtitle("Mean Tau-PET SUVR by ROI") +
  theme(axis.text.x=element_text(size=8, angle=45))

# Convert to interactive plotly graph
ggplotly(p_roi_suvr, height=1000, width=600, tooltip=c("label", "y"))
rm(p_roi_suvr)

ROI Normalization

These values are supposed to be normalized to the inferior cerebellum gray matter, indicated by INFERIOR_CEREBGM_SUVR. To confirm, I’ll check the distribution of INFERIOR_CEREBGM_SUVR values.

# Plot distribution of inferior cerebellum gray tau-PET SUVR
p_inf_cb <- tau.df %>%
  select(-VISCODE, -VISCODE2, -update_stamp) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # Select only SUVR columns for ROIs; pivot wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  # Filter to inferior cerebellum gray
  filter(ROI=="INFERIOR_CEREBGM") %>%
  ggplot(data=., mapping=aes(x=SUVR)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("Number of Occurences") +
  xlab("Inferior Cerebellum Gray SUVR") +
  ggtitle("Distribution of Inferior Cerebellum Gray Matter Tau Uptake") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive plotly histogram
ggplotly(p_inf_cb)
rm(p_inf_cb)

Most of the inferior cerebellum gray ROIs actually have SUVRs around 1.25. I’ll re-normalize all ROI values to the mean of this distribution in Data Preparation.

Subject demographics + cognitive assessments

Data overview

The longitudinal subject demographic and cognitive assessment dataset was downloaded as a CSV from the ADNI Study Data repository, located at Study Data/Study Info/Key ADNI tables merged into one table. This CSV file contains 14,816 rows and 113 columns. This includes many subjects with multiple follow-up visits. Columns include (anonymized) subject demographic information such as age, sex, race, and marriage status. Note: ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

This project will one key target feature in this dataset: Clinical Dementia Rating (CDR) Sum of Boxes. The CDR scale was initially developed in 1982 at the Washington University as a metric of clinical dementia progression (Hughes et al., 1982). This cognitive test measures impairment in six cognitive domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (Morris 1991). Each of these categories is scored on a five-point scale as follows:

  • 0 = no impairment
  • 0.5 = questionable
  • 1 = mild dementia
  • 2 = moderate dementia
  • 3 = severe dementia

The global score is most heavily influenced by the memory score, though there is an established decision tree-like algorithm for how to calculate the global score. By contrast, the CDR Sum of Boxes reflects the sum of scores for each of the six domains, with an overall range of 0 (no impairment) to 18 (severe impairment). The CDR Sum of Boxes is an extension upon the CDR global score, offering a more detailed quantitative score. This metric reportedly improves precision in longitudinal cognitive tracking, particularly in cases of mild dementia (O’Bryant et al., 2008). Of note, the CDR assessment shows high inter-rater reliability, which is important given the inter-site nature of the ADNI cohort (Morris 1991).

Sources:

  • Hughes, C. P., Berg, L., Danziger, W., Coben, L. A., & Martin, R. L. (1982). A new clinical scale for the staging of dementia. The British journal of psychiatry, 140(6), 566-572.
  • Morris, J. C. (1991). The clinical dementia rating (CDR): Current version and scoring rules. Neurology, 43(11), 1588-1592.
  • O’Bryant, S. E., Waring, S. C., Cullum, C. M., Hall, J., Lacritz, L., Massman, P. J., … & Doody, R. (2008). Staging dementia using Clinical Dementia Rating Scale Sum of Boxes scores: a Texas Alzheimer’s research consortium study. Archives of neurology, 65(8), 1091-1095.

Data loading

ADNI compiled a merged dataset containing key information from several tables, including subject demographics, selected cognitive assessment scores, and select biomarker data.

# Read in subject demographic dataset from ADNI
subj.info <- read.csv("../../ADNI_Data/Raw_Data/ADNIMERGE.csv", stringsAsFactors = T, na.strings="")
# Read in file explaining column meanings
subj.info.fields <- read_excel("../../ADNI_Data/Cleaned_Data/adnimerge_dict_mod.xlsx")
Here are the subject demographic dataset features and corresponding descriptions as detailed by ADNI:
datatable(subj.info.fields)

The columns I will be using for this project include:

  • RID: Participant roster ID, which serves as unique subject identifier
  • VISCODE: Visit code
  • EXAMDATE: Date
  • AGE: Age at visit
  • PTGENDER: Biological sex
  • CDRSB: CDR Sum-of-Boxes score at visit
subj.info$EXAMDATE <- as.Date(subj.info$EXAMDATE, format="%m/%d/%Y")
summary(subj.info %>% select(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB))
##       RID          VISCODE        EXAMDATE               AGE          PTGENDER   
##  Min.   :   2   bl     :2272   Min.   :2005-09-07   Min.   :54.40   Female:6593  
##  1st Qu.: 685   m12    :1836   1st Qu.:2008-12-23   1st Qu.:69.10   Male  :8223  
##  Median :2074   m06    :1618   Median :2012-06-19   Median :73.50                
##  Mean   :2613   m24    :1451   Mean   :2012-05-02   Mean   :73.48                
##  3rd Qu.:4513   m18    :1292   3rd Qu.:2014-03-31   3rd Qu.:78.30                
##  Max.   :6874   m36    : 857   Max.   :2020-07-27   Max.   :91.40                
##                 (Other):5490                        NA's   :4                    
##      CDRSB       
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 1.000  
##  Mean   : 2.083  
##  3rd Qu.: 3.000  
##  Max.   :18.000  
##  NA's   :4172

There is a lot of missing data in this dataset – however, this dataset includes many subjects and visit dates that don’t correspond to tau-PET scans, and therefore won’t be used in this analysis. Missingness will be re-evaluated once the PET data and subject demographic data is merged in Data Preparation.

Data distribution

The time distribution of ADNI cognitive assessment data can be visualized:
# plotly histogram of cognitive assessment dates
subj.info %>%
  select(RID, EXAMDATE) %>%
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Subject Demographics Date Distribution',
         xaxis = list(title = 'Visit Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of Subjects')) 

One thing to note is that tau-PET was only incorporated into the ADNI pipeline in late 2015/early 2016, so any demographic information from pre-2016 will not be included in modeling and analysis. Let’s check how many subjects had more than one visit recorded in this dataset:

# visualize distribution of # cognitive assessments per subject
p_subj_long <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  # fct_reorder --> arrange subject on x-axis by number of exams, from large to small
  ggplot(., aes(x=fct_reorder(RID, n_exams, .desc=T), y=n_exams, label=RID)) +
  geom_bar(stat="identity", aes(fill=n_exams, color=n_exams)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of ADNI Visits per Subject") +
  ylab("Number of Visits") +
  xlab("Subjects") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

# convert to interactive plotly histogram
ggplotly(p_subj_long, tooltip = c("y"))
rm(p_subj_long)

Unlike with the PET data, most subjects have two or more visits recorded with cognitive and demographic information. The subjects in this dataset have up to 21 CDR-Sum of Boxes scores. To examine precisely how many subjects have at least two visits recorded:

num.subj <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_exams=sum(n_exams))
cat("Number of subjects with at least two ADNI visits: **", 
    num.subj$num_subjects, "**\n", 
    "\nTotal number of longitudinal ADNI visits recorded: **", 
    num.subj$total_exams, "**\n", sep="")

Number of subjects with at least two ADNI visits: 2027

Total number of longitudinal ADNI visits recorded: 14571

There are 2027 subjects with two or more ADNI visits in this dataset. This should include all subjects and visit dates included in the tau-PET dataset, which will be confirmed upon merging the datasets in the Data Preparation stage.

Temporal distribution

It’s also worth checking the distribution of time interval between ADNI visits in these 2027 subjects with two or more visits:

# Plot # years between consecutive cognitive assessments by subject
p.subj.interval <- subj.info %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  arrange(EXAMDATE) %>%
  # Use lag to calculate time interval between exams
  mutate(Years_between_ADNI = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit first scan per subject, for which the time interval is zero
  filter(Years_between_ADNI>0) %>%
  ggplot(., aes(x=Years_between_ADNI)) +
  geom_histogram(stat="count", fill="lightsteelblue", color="lightslategray") +
  ggtitle("Years in between ADNI visits per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive ADNI visits") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 

# Convert to plotly interactive histogram
ggplotly(p.subj.interval)
rm(p.subj.interval)

Interestingly, there is a clear peak around 0.5 (six months) and a smaller peak around 1 (one year), indicating that most visits were spaced between 6 and 12 months apart. However, there is a positive skew to this distribution showing that a portion of subjects went up to five years in between visits. This is not likely to affect my analysis, as most tau-PET subjects had PET scans from 2018 onward, and would therefore have a follow-up interval of two years or less.

CDR-Sum of Boxes Scores

Now, I’ll look into the distribution of the target variable (CDRSB) and how they relate to other covariates, namely age and sex. These visualizations are filtered to show only those subjects with 2+ assessments.

# Plot CDR-SoB scores distribution across subjects with 2+ assessments
p.cdr.scores <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("# of Occurences") +
  xlab("CDR-Sum of Boxes") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to plotly interactive histogram
ggplotly(p.cdr.scores)
rm(p.cdr.scores)

CDR-SoB by Age + Sex

Now, to stratify by age and sex, respectively:

# Violin plot by sex for CDR-SoB
p.cdr.sex.violin <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=PTGENDER, y=CDRSB)) +
  geom_violin(aes(fill=PTGENDER)) +
  theme_minimal() +
  ylab("CDR Sum of Boxes Score") +
  xlab("Biological Sex") +
  geom_signif(map_signif_level = F,
              test="t.test",
              comparisons=list(c("Female", "Male"))) +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# CDR-SoB histogram by sex
p.cdr.sex.hist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count.., fill=PTGENDER)) +
  facet_wrap(PTGENDER ~ ., ncol=1) +
  theme_minimal() +
  ylab("Number of Subjects") +
  xlab("CDR Sum of Boxes Score") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")


# Convert plots to interactive plotly visualizations
p.cdr.sex.violin <- ggplotly(p.cdr.sex.violin)
p.cdr.sex.hist <- ggplotly(p.cdr.sex.hist)

# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.cdr.sex.violin, p.cdr.sex.hist, shareX=F, shareY=F,
                titleX=T, titleY=T) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "CDR Sum of Boxes Score", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)))

rm(p.cdr.sex.hist, p.cdr.sex.violin)

The distribution of CDR Sum of Boxes score looks similar between Females and Males by eye, but I’ll follow up with a t-test to confirm:

# t-test for CDR-SoB by sex
t.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
t.test(CDRSB ~ PTGENDER, data=t.test.df)
rm(t.test.df)
## 
##  Welch Two Sample t-test
## 
## data:  CDRSB by PTGENDER
## t = -2.804, df = 9542.1, p-value = 0.005057
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.27228130 -0.04822416
## sample estimates:
## mean in group Female   mean in group Male 
##             2.000650             2.160903

In fact, there is actually a statistically significant (p<0.05) difference in CDR Sum of Boxes scores between males and females, with male subjects exhibiting an average score ~0.15 points higher than female subjects. This is an important consideration, and I will be sure to include sex as a covariate in prediction models where applicable.

Similarly, I’ll compare CDR Sum of Boxes with age:

# Plot CDR-SoB by age in scatter plot, only for subjects with 2+ cognitive assessments
p.age.cdr <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE, y=CDRSB)) +
  labs(color="Number of Points") +
  xlab("Age at Visit") +
  ylab("CDR Sum of Boxes") +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  geom_point(size=1, alpha=0.2, color="lightslategray", fill="lightslategray") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# Plot histogram of age distribution for subjects with 2+ cognitive assessments
p.age.dist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE)) +
  xlab("Number of Occurrences") +
  ylab("Age at Visit") +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  theme(plot.title=element_text(hjust=0.5))

p.age.cdr <- ggplotly(p.age.cdr)
p.age.dist <- ggplotly(p.age.dist)
# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.age.cdr, p.age.dist, shareX=F, shareY=F,
                titleX=T, titleY=T, margin = 0.05) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "Age at Visit", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         autosize = F, width = 800, height = 500)

rm(p.age.cdr, p.age.dist)

While there doesn’t appear to be any clear linear association between age at visit and CDR Sum of Boxes score, I’ll use cor.test to calculate the Pearson correlation coefficient and the corresponding p-value based on the correlation coefficient t-statistic:

# Correlation test between age and CDR-SoB for subjects with 2+ cognitive assessments
cor.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
cor.test(cor.test.df$AGE, cor.test.df$CDRSB)
## 
##  Pearson's product-moment correlation
## 
## data:  cor.test.df$AGE and cor.test.df$CDRSB
## t = 6.8375, df = 10395, p-value = 8.512e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04775209 0.08602444
## sample estimates:
##        cor 
## 0.06691288

Interestingly, this correlation coefficient is statistically significant (p=8.512e-12), but the effect size is very small (r=0.067). This significance seems to reflect the sheer size of the dataset rather than a strong relationship between age and CDR sum of boxes scores. Nevertheless, I will explore the use of age as a covariate in modeling later as this is a common practice.

Outlier detection

To better identify outliers based on this multivariate dataset, I will calculate Cook’s distance for each data point once the tau-PET data is merged with the cognitive status data in the Data Preparation stage.

Previous page: Business Understanding Next page: Data Preparation